Skip to content

Conversation

@huan233usc
Copy link
Collaborator

@huan233usc huan233usc commented Feb 1, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Incremental CRC loading for P&M was added in #4077.

For snapshot N, we try to read N.crc, if N.crc missing, we list CRC files and find the one up to N-100. These two file listing operation could be actually avoided, because we have already listed files under _delta_log folder.

This PR collects latest CRC in the first file listing and remove two unnecessary IO.

How was this patch tested?

LogReplay test

Does this PR introduce any user-facing changes?

No

@huan233usc huan233usc self-assigned this Feb 3, 2025
@huan233usc huan233usc marked this pull request as ready for review February 13, 2025 18:25

// Must be final to be used in lambda
final AtomicBoolean hasReturnedAnElement = new AtomicBoolean(false);
final AtomicBoolean hasReturnedlogOrCheckPoint = new AtomicBoolean(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasReturnedCommitOrCheckpoint

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a comment here. Something like: This variable is used to help determine if we should throw an error if the table history is not reconstructable. Only commit and checkpoint files are applicable.

}

hasReturnedAnElement.set(true);
// Only log and checkpoint could use to construct table state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete this comment line

newVersion,
deltasAfterCheckpoint,
latestCompleteCheckpointFileStatuses,
listedChecksumFileStatues.stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it just be the last one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated

+ list.stream().map(FileStatus::toString).collect(Collectors.joining(",\n "));
}

public Optional<FileStatus> getLatestChecksum() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I think these needs method docs? This checksum might not be for the version that this LogSegment represents, right?

also ... suppose we have 10.checkpoint and 11.json and 12.json ...

what are the possible ranges of the version for which this CRC could be for? Could we have 8.crc? what are the guarantees of this version range?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logSegment, we just have 8.crc, downstream(LogReplay) decides use it or not.

@huan233usc huan233usc requested a review from scottsand-db April 28, 2025 17:43
/** Returns the crc info for the current snapshot if the checksum file is read */
public Optional<CRCInfo> getCurrentCrcInfo() {
return currentCrcInfo;
return cachedCrcInfo.filter(crc -> crc.getVersion() == logSegment.getVersion());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- to clarify: the consumers of this API only ever want the CRCInfo if it was read during log replay? They never just want the CRCINfo for the latest version, even if it wasn't read during log replay? That semantic seems a bit odd to me -- can you please clarify / explain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the consumer just try to get crcInfo of the current version. We return if it is cached. Updating the cache is not necessary during log replay's construction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the consumer just try to get crcInfo of the current version.

But this API will only return it if it was read / cached?

My confusion is this:

  • it seems like consumers want the CRC info for the current version. full stop. if it exists, they want it
  • this API returns it if and only if it exists and we had to read it to update the snapshot hint

So, there seems to be a gap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the concern here is --- when version hint is used, then crcInfo will always empty. This was unfortunately the behavior before, although problematic.

I updated with a fix. (Previously it will cause CRC missing when snapshot hint is used, now it is getting worse as we load domain metadata, so agree to fix it along the way.)

@huan233usc huan233usc requested a review from scottsand-db April 28, 2025 23:57
@vkorukanti vkorukanti merged commit 33e8666 into delta-io:master May 2, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants